Skip to content

fix: suppress net.ErrClosed on concurrent Close#564

Draft
mafredri wants to merge 1 commit into
masterfrom
mafredri/fix-close-race-net-errclosed
Draft

fix: suppress net.ErrClosed on concurrent Close#564
mafredri wants to merge 1 commit into
masterfrom
mafredri/fix-close-race-net-errclosed

Conversation

@mafredri
Copy link
Copy Markdown
Member

@mafredri mafredri commented May 22, 2026

Close and CloseNow returned a wrapped net.ErrClosed when the internal read loop or write path won the casClosing race before the user-facing call. The defer that suppresses net.ErrClosed was only registered on the CAS winner path, so the first user call to Close could surface "use of closed network connection" even though the close completed successfully.

This separates user-facing close tracking (userClosed atomic) from internal close coordination (casClosing). Combined with isClosed, this gives three distinct behaviors:

  • First user call, close in progress or peer already closed: nil.
  • Concurrent user calls while close is in progress: nil, returns instantly without blocking.
  • User call after close has fully completed: net.ErrClosed.

Refs coder/internal#1541

🤖 This PR was created with the help of Coder Agents, and will be reviewed by a human. 🏂🏻

@mafredri mafredri force-pushed the mafredri/fix-close-race-net-errclosed branch from 07d6409 to b76fd5d Compare May 22, 2026 11:05
Close and CloseNow returned a wrapped net.ErrClosed when the
internal read loop or write path won the casClosing race before
the user-facing call. The defer that suppresses net.ErrClosed
was only registered on the CAS winner path.

Separate user-facing close tracking from internal close
coordination. A userClosed atomic tracks whether the user has
called Close/CloseNow before. Combined with isClosed, this
gives three distinct behaviors:

- First user call, close in progress or peer closed: nil.
- Concurrent user calls while close is in progress: nil,
  returns instantly without blocking.
- User call after close has fully completed: net.ErrClosed.
@mafredri mafredri force-pushed the mafredri/fix-close-race-net-errclosed branch from b76fd5d to e5aa423 Compare May 22, 2026 11:13
@mafredri
Copy link
Copy Markdown
Member Author

/coder-agents-review

@coder-agents-review
Copy link
Copy Markdown

coder-agents-review Bot commented May 22, 2026

Review posted | Chat
Requested: 2026-05-22 14:18 UTC by @mafredri

Review history
  • R1 (2026-05-22): 12 reviewers, 2 Nit, 2 Note, 1 P2, 3 P3, COMMENT. Review

deep-review v0.5.0 | Round 1 | d099e16..e5aa423

Last posted: Round 1, 8 findings (1 P2, 3 P3, 2 Nit, 2 Note), COMMENT. Review

Finding inventory

Findings

# Sev Status Location Summary Round Reviewer Posted
CRF-1 P3 Open close.go:92 Doc comment claims "Additional calls to Close are no-ops" but post-close calls return net.ErrClosed R1 Netero, Pariston Yes
CRF-2 Note Dropped by orchestrator (pre-existing, outside diff) close_test.go:49 Pre-existing tc := tc in 4 range loops, unnecessary since Go 1.22 R1 Netero No
CRF-3 P2 Open close.go:102 No regression test for the race condition this PR fixes R1 Bisky P1, Hisoka P2, Mafu-san P2, Mafuuu P2, Meruem P2, Chopper P2, Pariston P3, Leorio P3, Takumi P3 Yes
CRF-4 P3 Open close.go:130 CloseNow doc comment says nothing about repeated or concurrent call behavior R1 Leorio Yes
CRF-5 Nit Open close.go:111 PR description claims concurrent calls "return instantly without blocking" but they block on waitGoroutines for up to 15s R1 Pariston Nit, Mafuuu Note, Takumi Note Yes
CRF-6 P3 Open conn.go:80 userClosed naming: reads as completion (like closed) but tracks initiation; also set by internal library code via CloseRead R1 Gon P3, Hisoka Note, Chopper Note Yes
CRF-7 Nit Open conn.go:80 Comment "on first user call" restates what name and type already show R1 Gon P2, Leorio Nit Yes
CRF-8 Note Open conn.go:80 Close state tracked across three independent boolean-ish variables forming implicit state machine R1 Meruem Yes
CRF-9 Note Open close.go:102 Non-atomic gap between userClosed.Swap and isClosed is benign; system converges to correct state R1 Takumi Yes

Contested and acknowledged

(none)

Round log

Round 1

Panel. 1 P2, 3 P3, 2 Nit, 2 Note new. 1 pre-existing Note dropped. Reviewed against d099e16..e5aa423. Panel: Bisky, Hisoka, Mafu-san, Mafuuu, Pariston, Ging-Go, Gon, Leorio, Takumi, Meruem, Killua, Chopper.

About deep-review

CRF = Coder Review Finding (P0-P4, Nit, Note)

Reviewer Focus
Bisky tests
Chopper ops/errors
Churn-guard change verification
Ging language modernization
Gon naming
Hisoka edge cases
Killua perf
Kite change integrity
Knov contracts
Knuckle SQL
Kurapika security
Leorio docs
Luffy product
Mafu-san process
Mafuuu contracts
Melody dispatch/pairing
Meruem structural
Nami frontend
Netero mechanical checks
Pariston premise testing
Pen-botter product gaps
Razor verification
Robin duplication
Ryosuke Go arch
Takumi concurrency
Zoro shape

🤖 Managed by Coder Agents.

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean diagnosis, proportional fix. The userClosed atomic correctly separates user intent from internal close coordination, and both Close and CloseNow are updated symmetrically. The commit message is the standout: it names the condition, the mechanism, and the fix, then enumerates all three behavioral states.

The dominant finding is the missing regression test. Nine of twelve reviewers flagged it independently. The race this PR fixes (internal read loop wins casClosing before the user calls Close) has no test coverage, and no existing test distinguishes the old behavior (wrapped net.ErrClosed) from the new behavior (nil). A bug fix without a test for the bug is incomplete work.

Two doc comments need attention: Close's comment claims repeated calls are "no-ops" (they return net.ErrClosed post-close), and CloseNow's comment is silent on the topic entirely.

Severity breakdown: 1 P2, 3 P3, 2 Nit, 2 Note.

"I tried to build a simpler alternative using only isClosed() as the early guard, but it returns net.ErrClosed on a first user call when the internal close already completed, which is wrong from the user's perspective: their first Close should succeed regardless of internal state. The userClosed separation is the minimal mechanism that preserves that invariant." (Pariston)


close.go:92

P3 [CRF-1] Doc comment claims "Additional calls to Close are no-ops" but the new behavior returns net.ErrClosed after the connection is fully closed. The three behaviors in the commit message (first call nil, concurrent calls nil, post-close net.ErrClosed) are the actual contract; the doc should match. (Netero, Pariston)

🤖

close.go:130

P3 [CRF-4] CloseNow's doc comment says nothing about repeated or concurrent call behavior. The three behaviors in the commit message apply to CloseNow identically, but someone writing defer c.CloseNow() that might race with another goroutine has no way to know from the godoc what happens. (Leorio)

🤖

🤖 This review was automatically generated with Coder Agents.

Comment thread close.go
if err != nil {
return err
}
if c.userClosed.Swap(true) && c.isClosed() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 [CRF-3] No regression test for the race condition this PR fixes. The bug: when the internal read loop (read.go:361) or write path (write.go:365) wins casClosing() before the user calls Close, the old code returned a wrapped net.ErrClosed. No existing test exercises this interleaving.

The existing TestConcurrentClosePing exercises concurrent ping+close but not any close+close path. TestConnClosePropagation exercises close propagation but not the return value contract for the caller that loses the casClosing race. (Mafuuu)

I verified this by reverting the PR changes and running all close-related tests 10 times with the race detector on the old code. All passed. (Pariston)

A test that reproduces this: set up a connection with CloseRead, trigger the peer to send a close frame so the local read loop wins casClosing, then call Close from user code and assert nil return. (Bisky P1, Hisoka P2, Mafu-san P2, Mafuuu P2, Meruem P2, Chopper P2, Pariston P3, Leorio P3, Takumi P3)

🤖

Comment thread close.go
}
}()

if c.casClosing() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [CRF-5] The PR description claims concurrent user calls "return instantly without blocking." In practice, a concurrent caller reaches casClosing() -> true, enters waitGoroutines(), and blocks on c.closeReadDone and c.closed with a 15-second timer. The blocking is correct behavior (caller gets confirmation that close completed), but the description overstates it. (Pariston, Mafuuu, Takumi)

🤖

Comment thread conn.go
closing atomic.Bool
closeMu sync.Mutex // Protects following.
closed chan struct{}
userClosed atomic.Bool // Set by Close/CloseNow on first user call.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [CRF-6] userClosed reads as completion alongside closed, but it tracks initiation. The struct now has three close fields in near-identical tenses (userClosed, closing, closed) where the first means "called" and the last means "done." Also, CloseRead's goroutine (read.go:82) calls the public c.Close(), setting userClosed from internal library code. closeRequested would disambiguate stage and scope without reading the body. (Gon P3, Hisoka Note, Chopper Note)

🤖

Comment thread conn.go
closing atomic.Bool
closeMu sync.Mutex // Protects following.
closed chan struct{}
userClosed atomic.Bool // Set by Close/CloseNow on first user call.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [CRF-7] Comment "on first user call" restates what the name and type already show. The user is in the name, first call is derivable from atomic.Bool set once. Trim to // Set by Close/CloseNow. or // Tracks whether the user has called Close/CloseNow. (Gon, Leorio)

🤖

Comment thread conn.go
closing atomic.Bool
closeMu sync.Mutex // Protects following.
closed chan struct{}
userClosed atomic.Bool // Set by Close/CloseNow on first user call.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note [CRF-8] Close state is now tracked across three independent boolean-ish variables (userClosed, closing, closed channel) forming an implicit state machine with ~5 reachable states and undocumented transitions. The current approach works correctly and is proportionate for a bugfix, but the next person who modifies close coordination must understand all three in combination. (Meruem)

🤖

Comment thread close.go
if err != nil {
return err
}
if c.userClosed.Swap(true) && c.isClosed() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note [CRF-9] The userClosed.Swap(true) && c.isClosed() check is not atomic (the connection could finish closing between the swap and the channel probe), but the gap is benign. A second user call that falls through instead of returning net.ErrClosed immediately reaches casClosing() -> waitGoroutines(), the defer suppresses net.ErrClosed, and the caller gets nil. The system converges to the correct state on both sides of the gap. (Takumi)

🤖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant